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Abstract 

To maximize efficiency in time and space, allocations and dealloca- 
tions, in the exact linear algebra library LinBox, must always occur in 
the founding scope. This provides a simple lightweight allocation model. 
We present this model and its usage for the rebinding of matrices between 
different coefficient domains. We also present automatic tools to speed-up 
the compilation of template libraries and a software abstraction layer for 
the introduction of transparent parallelism at the algorithmic level. 

1 Introduction 

As a building block for a wide range of applications, computational exact 
linear algebra has to conciliate efficiency and genericity. The goal of the 
LinBox project is to address this problem in the design of an efficient 
general-purpose C++ open-source library for exact linear algebra over the 
integers, the rationals, and finite fields. Matrices can be either dense, 
sparse or black box (i.e. viewed as a linear operator, acting on vectors 
only). The library proposes a set of high level linear algebra solutions, 
such as the rank, the determinant, the solution of a linear system, the 
Smith normal form, the echelon form, the characteristic polynomial, etc. 
Each of these solutions involves a hybrid combination of several specialized 
algorithms depending on the domain, and the type of matrix. Over a finite 
field, the building blocks are an efficient implementation of Wiedemann 
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and block Wiedemann algorithms combined with preconditioners [T] for 
black box matrices, a sparse Gaussian elimination for sparse matrices and 
the BLAS based dense linear algebra techniques of the FFLAS library [3] 
for dense matrices. The solutions over the integers and rationals are lifted 
from modular computations by a Chinese remainder algorithm or p-adic 
lifting. The design is based on high genericity to allow us to write effi- 
cient algorithms independent of the many representations of domains and 
matrices. As a middleware, the library relies on the efficiency of kernel 
libraries such as GMpQ, Givaro 4 , NTL 4 , ATLAS 4 and can be used by general 
purpose computer algebra systems such as Sage 4 or Maple 4 . 

We describe in this paper a selection of ideas and improvements that 
were recently introduced into the the design of LinBox for the forthcoming 

2.0 release. 

2 The lightweight founding scope alloca- 
tion model 

The main objects that require memory allocation in LinBox are base 
field or ring elements, vectors, matrices, and polynomials. The memory 
management for all of these object types follows the same rules, organized 
to maximize efficiency in time and space, and consequently requiring some 
efforts by the programmer: the allocations and deallocations must always 
occur in the founding scope. In particular no external garbage collection 
mechanism is used. 

2.1 Call- by-reference 

The input and output types of most functions are usually template types, 
and can be either basic types, or complicated objects. Consequently, 
passing arguments by value (copy) must be avoided as much as possible. 
Every argument is passed as a reference, including the return types. More 
precisely the return value of a function is also the first argument, defined 
as a non const reference. 

Matrixft someFunction(Matrix& result, const XXX& args) ; 

This convention was already presented in [5] §2.1] for the design of field 
and ring arithmetic. It does require a redefinition of the interface for some 
stl-like operators, as discussed in section ETUI A consequence of the above 
convention is that the objects returned by a function, have to be declared 
and initialized (in particular, memory allocated, e.g. via constructors) 
before the function call. By enforcing this practice, we require that the 
programmer keep the handle on the objects that he allocates until all uses 
of the object and its subobjects are completed. Moreover, he is respon- 
sible for object deallocation in the same scope where it was allocated. 
This restricts some convenient programming practices, but provides pre- 
cise control of memory usage. This is particularly important when large, 
memory filling, matrices are in play. It also allows to avoid the costs of 

1 gmplib. org, www-ljk . imag . fr/CASYS/LQGICIELS/givaro, www. shoup.net/ntl , math- atlas . sourcef orge .net , sagemath. org, * 
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garbage collection or reference counting. Many LinBox objects involve a 
handle containing a reference to the free store. Note that even though a 
function does not allocate the handle itself, it is in certain cases still free 
to resize and thus reallocate the free store memory referenced. 

Dense Matrix allocations. The objects storing dense matrices re- 
quire a special care concerning their allocations. Dense matrices are rep- 
resented as a one dimensional array storing elements in the row major 
format: A[i,j] = *(A+i*n+j). It is important to be able to define a 
sub-matrix as a view on such an array, without allocating the data. For 
this we propose to distinguish two classes: one for allocated (via construc- 
tors) matrices and the other for sub-matrix views. The genericity of the 
template mechanism or inheritance will allow to use these two types in 
the same code, without duplication. This allows also for an automatic 
decision about deallocation. Other solutions includes reference counting 
and explicit "end of use" functions. 

Thus a first approach is to define a dense matrix class with a boolean 
_alloc member, telling whether the matrix owns its data or whether it 
is a simple view on some other matrix's data. The destructor deallocates 
the data only if _alloc is true. This can be viewed as a simplified ref- 
erence counting mechanism, where one assumes that the matrix initially 
allocated is always destructed after all of its sub-matrices. This conven- 
tion is consistent with the previous consideration: the allocations and 
deallocations must always occur in the founding scope. 

To further improve the efficiency, an alternative is to distinguish two 
classes: one for allocated matrices and the other for sub-matrix views. 
The genericity of the template mechanism or inheritance will allow to 
use these two types in the same code, without duplication. Furthermore, 
thread-safety mechanism on the _alloc member are not required anymore. 

Remark that in this founding scope model, neither the alloc vari- 
able nor a two classes system is required. The programmer should know 
whether a matrix is created as a sub-matrix or as an allocating instance 
by what constructors or other initializers she uses. Thus she knows which 
require care to deallocate in the same scope. What she does not necessar- 
ily get is automatic decision about deletion in the destructor, and would 
thus have to call an explicit "end of use" function. 

2.2 Rebind of coefficient domains 
2.2.1 Mapping of data between domains 

LinBox makes use of the concept of rebinds for the mapping of data struc- 
tures between different coefficient domains. For instance, in the context of 
the Chinese remainder algorithm, rebinds allow to map a matrix over the 
integers of type, say, (DenseMatrix<PID_Integer>) to a modular matrix 
of type, say, 

(DenseMatrix<Modular<double> >). 

In LinBox, binder adaptors are enclosed within many data struc- 
tures and make use of a generic converter, named Horn and found in 
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linbox/f ield/hom.h Horn can generically use the LinBox domain's canon- 
ical conversion methods methods init and convert from/to the LinBox 
Integer type: Domainl — » Integer — > Domain2. Moreover, when natural, 
efficient conversions exists between domains (e.g. different representations 
of the same field or one ring embedded in another), generic Horn can be 
directly bypassed by a specialization of Horn. 

Rebind of dense matrices. We illustrate the founding scope allo- 
cation model with the use of rebind functions adapted from the allocators 
in the STL, on dense matrices. 

template <class Domain> class DenseMatrix ■[ 
typedef DenseMatrix<Domain> Self_t; 

template-Cclass AnyDomain> struct rebind{ 
typedef DenseMatrix <AnyDomain> other; 

operator () (otherfc Ap, const Self_t& A, const AnyDomainft D){ 
// Performs the modular conversion of A to Ap over D 
typename Self _t : : ConstRawIterator A_iter; 
typename other : :RawIterator Ap_iter; 
Hom<Field, _Tpl> hom(A. f ield() , F) ; 

for (A_iter = A. rawBeginO , Ap_iter = Ap . rawBeginO ; 
A_iter != A. rawEndO ; ++ A_iter, ++ Ap_iter) 
horn. image (*Ap_iter, *A_iter) ; 

} 

}; 

} 

According to the founding scope allocation model, the function operator () 
in charge of the initialization of the matrix cannot allocate any memory. 
This has to be done at the level where the rebind is called. This also 
requires a modification of the rebind operator interface of the STL: the 
new object is passed by reference. 

2.2.2 Rebind of handlers in the founding scope allocation 
model 

In the case of BlackBoxes (functions providing only a matrix- vector prod- 
uct and not necessarily storing any data) the rebind mechanism becomes 
more specific. We detail in this section the solution provided in blackboxes 
which only store references to other blackboxes, such as the Compose, 
Transpose, Submatrix, etc. 

Indeed to rebind a blackbox containing only references one should al- 
locate a new memory zone and rebind the refered blackbox there. The 
problem is that a caller, given a BlackBox : : rebind<Field2> : : other type, 
does not necessarily know how to allocate for this object. The STL solu- 
tion is to embed the allocator in each container. In LinBox, we propose 
another solution: the other not only has different elements, but also can 
be of a different type. For instance, the rebind other type of a blackbox 
containing a reference will be the same kind of blackbox, but physically 
storing the data (and thus owning it). 
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For the different blackboxes defined in LinBox which use references, 
we thus define a similar class called e.g. TransposeOwner, ComposeOwner, 
SubmatrixOwner, etc. These classes store and own their data. Then it 
suffices for the rebind sub-class of their reference equivalents to define its 
other type to the associated *0wner class. The example of the Compose 
is given in figure [T] 

Not this also fits well in the LinBox founding scope allocation, since 
the *0wner class will be declared (and thus allocated) by the caller of the 
rebind in codes similar to the following: 

template<class BlackBox> void f (const BlackBoxft A) { 

typedef typename Blackbox: : template rebind<Field2> :: other FBlackbox; 
// rebinds generically the BlackBox A to a BlackBox Ap 
// with a new Domain F2 

// The container type of Ap might be different from the one of A 
// this decision is made in the rebind type of A, 
// via the 'other' typedef 
FBlackbox Ap(A, F2) ; 

} 

Remark that for a submatrix of a class storing its elements (contrary 
to a submatrix of a blackbox containing e.g. only references), a more 
efficient rebind would only rebind the elements within the boundaries 
of the submatrix. There we use a trait to decide wether the refered 
blackbox is a storing component and in the latter case specialize e.g. 
Submatrix<DenseMatrix<Fieldl> >: : rebind<Field2> : : other to a sim- 
pler DenseMatrix<Field2> instead of using the *0wner mechanism. 

3 Software abstraction layer for parallelism 

Efficient parallel applications must take into consideration hardware char- 
acteristics (number of cores, memory hierarchy, etc.). It is time consum- 
ing or impossible for a single developer to program a high performance 
computer algebra application, with state of the art algorithms, while ex- 
ploiting all the available parallelism. In order to separate the domains 
of expertise we have designed a software abstraction layer between com- 
puter algebra algorithms and parallel implementations which may employ 
automatic dynamic scheduling. 

3.1 Parallel building blocks 

Computer algebra algorithms have three main characteristics: 1) they are 
complex and require a deep knowledge of the problem in order to obtain 
the most efficient sequential algorithm; 2) they may be highly irregular. 
This enforces a runtime use of load balancing algorithms; 3) they are 
generic in the sense that they are usually designed to work over several 
algebraic domains. 
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template <class _Blackboxl, class _Blackbox2> class Compose { 



template<typename _Tpl, typename _Tp2 = _Tpl> struct rebind { 
typedef ComposeOwner< 

typename Blackboxl :: template rebind<_Tpl> :: other , 
typename Blackbox2 :: template rebind<_Tp2> :: other 

> other; 

>; 

const Blackboxl * _A_ptr; 
const Blackbox2 * _B_ptr; 

>; 

template <class _Blackboxl, class _Blackbox2> class ComposeOwner { 

template<typename _Tpl , typename _Tp2 = _Tpl> struct rebind { 
typedef ComposeOwner< 

typename Blackboxl :: template rebind<_Tpl> :: other , 
typename Blackbox2 :: template rebind<_Tp2> :: other 

> other; 

>; 

template-ctypename _BBtl, typename _BBt2, typename Field> 
ComposeOwner (const Compose<_BBtl , _BBt2> &M, const Fieldfe F) 

: _A_data(*(M.getLeftPtr()) , F) , _B_dataO(M.getRightPtr()) , F) { 

typename Compose<_BBtl , _BBt2> :: template rebind<Field>() (*this,M,F) ; 

> 

template<typename _BBtl, typename _BBt2, typename Field> 
ComposeOwner (const ComposeOwner<_BBtl , _BBt2> &M, const Fieldfe F) 
: _A_data(M.getLeftData() , F) , _B_data(M.getRightData() , F) { 

typename ComposeOwner<_BBtl , _BBt2> :: template rebind<Field> () (*this,M,F) ; 

> 

Blackboxl _A_data; 
Blackbox2 _B_data; 

>; 



Figure 1: Owner mechanism for the composed blackboxes 
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In the case of LinBox algorithms, we have decided to base our soft- 
ware abstraction, called Parallel Building Blocks (PBB), on the STL algo- 
rithms (Standard Template Like) principles. Indeed, C++ data structures 
in LinBox let us have random access iterators over containers which are 
naturally parallel. We have already defined several STL-like algorithms 
and the list will be extended in the near future: 

for_each, transform, accumulate the PBB versions of these algo- 
rithms are similar to the STL versions except that the involved operators 
(or function object classes), given as parameters, are required to have their 
return value reference passed as the first parameter of the function. This 
is in accordance with the memory model of LinBox. The STL return-by- 
value semantic is not appropriate. 

The fundamental idea of PBB is that at the computer algebra level, the 
parallelization of all the loops and more generally of all the STL-like al- 
gorithms will already enable good performance and easy switching among 
multiple implementations. Regarding performance, this parallelization of 
the inner loops of the underlying linear algebra is sufficient in many cases. 
Regarding implementations, this abstraction provides for programming in- 
dependent of the parallel model with selection of the parallel environment 
depending on the target architecture. The parallel blocks can be imple- 
mented using many different parallel environments, such as OpenMP[3; 
TBB 7 (Thread Building Blocks) or Kaapi [6]; using both static and dy- 
namic work-stealing schedulers [5] . The current implementations are built 
on OpenMP and Kaapi. 

3.2 Accumulate_until and early termination 

To bound the complexity of many linear algebra problems, one of the key 
ideas is to use an accumulation with early termination. 

For instance, this is used in Chinese Remaindering algorithms. The 
computation is performed modulo a sequence of (co)prime numbers and 
the result is built from a sequence of residues, until a condition is satis- 
fied [3]. The termination of the algorithm depends on the accumulated 
result. 

In order to parallelize such algorithms, we proposed an extension of 
the STL algorithms called accumulate_until . The algorithm takes an 
array v of length N, a unary operator / to be applied to each array en- 
try and a specific binary update operator/predicate for the accumulation. 
This accumulator with a signature like bool accum(a, b) behaves like an 
in place addition (a+=b) and returns true to indicate sufficiently many 
values are accumulated. Let Sk = X^i=o fe with k € {0, N}. The 

algorithm computes and returns n < N and S„ such that one accumula- 
tion during the computation of S„ returned true or n — N. In intended 
use, we know any additional accumulation would also return true. 

This algorithm will be used for the early termination Chinese remain- 
dering algorithms of LinBox. Though not yet using PBB and accumu- 
late_until , a sequential version and parallel versions with OpenMP and 

www. sgi . com/tech/stl 

openmp . org threadingbuildingblocks . org 



7 



Kaapi can be found in the LinBox distributions as linbox/algorithms/ era-domain-* .h 



3.3 Memory contention and thread safe alloca- 
tion 

Many computer algebra programs allocate dynamic memory for the in- 
termediate computations. Several experiments with LinBox algorithms 
on multicore architectures have shown that these allocations are quite of- 
ten the bottleneck. An analysis of the memory pattern and experiments 
with three well known memory allocators (ptmalloc, Hoard and TCMal- 
loc from Google Perf. Tools [7]) have been conducted. The goal was to 
decide whether the parallel building blocks model was suitable to high- 
performance exact linear algebra. We used dynamic libraries to exchange 
allocators for the experiments, but one can use them together in the Lin- 
Box library if needed [8] §7]. Preliminary experiments on early terminated 
Chinese remaindering, not the easiest to parallelize, have demonstrated 
the advantage, in our setting, of TCMalloc over the others [3]. One of 
the main reasons for that fact is that our problems required many tem- 
porary allocations. This fits precisely the thread safe caching mechanism 
of TCMalloc. 

4 Automated Generic Separate compila- 
tion 

LinBox is developed with several kinds of genericity: 1) genericity with 
respect to the domain of the coefficients, 2) genericity with respect to the 
data structure of the matrices, 3) genericity with respect to the interme- 
diate algorithms. While this is efficient in terms of capabilities and code 
reusability, there is a combinatorial explosion of combinations. Consider 
that each of 50 arithmetic domains may be combined with each of 50 
matrix representations in each of 10 intermediate algorithm forms for a 
single problem as simple as matrix rank. This lengthens the compilation 
time and generates large executable files. 

For the management of code bloat LinBox has used an "archetype 
mechanism" which enables, at the user's option, to switch to a compi- 
lation against abstract classes [2l §2.1]. However, this can reduce the 
efficiency of the library. Therefore, we propose here a way to provide a 
generic separate compilation. This will not deal with code bloat, but will 
reduce the compilation time while preserving high performance. This is 
useful for instance when the library is used with unspecialized calls. This 
is largely the case for some interface wrappers to other Computer algebra 
systems such as Sage or Maple. Our idea is to automate the technique 
of [5] which combines compile-time instantiation and link-time instanti- 
ation, while using template instantiation instead of void pointers. The 
mechanism we propose is independent of the desired generic method, the 
candidate for separate compilation, and is explained in algorithm [T] 

This Algorithm is illustrated on figure[5] where the function is the rank 
and the template parameter is a dense matrix over GF(2), DenseMatrix<GF2>. 



Algorithm 1 C++ Automatic separate compilation wrapping 
Input: A generic function func. 

Input: Template parameters TParam for separate specialization/compilation of 
func. 

Output: A generic function calling func with separately compiled instantia- 
tions. 

1: Create a header and a body files "func_instantiate.hpp" and 

"func_instantiate.cpp" ; 
2: Add a template function f unc_separate, with the same specification as 

func, to the header; 
3: Its generic default implementation is a single line calling the original function 

func. 

{This enables to have a unified interface, even for non specialized class.} 
4: for each separately compiled template parameter TParam do 
5: Add a non template specification f uncTParam, to the header file; 
6: Add the associated body with a single line returning the instantiation of 

func on a parameter of type TParam, to the body file; 
7: Add an inline specialization body of f unc_separate on a parameter of 

type TParam with a single line returning f uncTParam, to the header file; 
8: end for 

9: Compile the body file "func_instantiate.cpp" . 



'template<clas3 Mat> V Default call 

rank(const Mat&) J 


f \ 

tempi at e<class Mat> 

, rank_separate(const Mat&) 


stantiate j 




h f ^ Specialized call 

3 [rankDenseMatGF2 p 1 


tempi at e<> 

^rank_separate(const DenseMat<GF2>&)J 







rankDenseMatGF2 



Separately compiled body 

Figure 2: Separate compilation of the rank 

Algorithm Q] has been simplified for the sake of clarity. To enable 
a more user-friendly interface one can rename the original function and 
all its original specializations f unc_original; then rename also the new 
interface simply func. With the classical inline compiler optimizations, 
the overhead of calling f unc_separate is limited to single supplementary 
function call. Indeed all the one line additional methods will be automat- 
ically inlined, except, of course, the one calling the separately compiled 
code. If this overhead is too expensive, it suffices to enclose all the non 
generic specializations of "funcjnstantiate.hpp" by a macro test. At com- 
pile time, the decision to separately compile or not can be taken according 
to the definition of this macro. 



We show in tables [TJ and [2] the gains in compilation time obtained on 
two examples from LinBox: the examples/{rank, solve} .C algorithms. 
Indeed, without any specification the code has to invoke several special- 
izations depending on run-time discovered properties of the input. For 
instance solve . C requires 6 specializations for sparse matrices over the 
Integers or over a prime field, with a sparse elimination, or an iterative 
method, or a dense method, if the matrix is small. . . 
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Table 1: linbox/examples/ {rank, solve}. C compilation time on an AMD Athlon 
3600+, 1.9GHz, with gec 4.5 -02. instantiate . o contains to the separately 
compiled instantiations (e.g. densegf2rank in figure [5]); {rank, solve} . o con- 
tains to the user interface and generic implementation compilation; link corre- 
sponds to the linking of both . o and the library; Full comp . corresponds to 
the compilation without the separate mechanism. 
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5.0 


5.0 


5.1 
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Table 2: linbox/examples/ {rank, solve}. C compilation time on an intel Xeon 
E5345, 2.33GHz, with icc-11.1 -02. 



Acknowledgment 

We thank the LinBox group and especially Brice Boyer, Pascal Giorgi, 
Erich Kaltofen, Dan Roche, Brian Youse for many useful discussions in 
particular during the recent LinBox developer meetings in Delaware and 
Dublin. 



10 



References 



[1] L. Chen, W. Eberly, E. Kaltofen, B. D. Saunders, W. J. Turner, and 
G. Villard. Efficient matrix preconditioners for black box linear alge- 
bra. Linear Algebra and its Applications, 343-344:119-146, 2002. 

[2] J.-G. Dumas, T. Gautier, M. Giesbrecht, P. Giorgi, B. Hovinen, 
E. Kaltofen, B. D. Saunders, W. J. Turner, and G. Villard. Lin- 
Box: A generic library for exact linear algebra. In A. M. Cohen, 
X.-S. Gao, and N. Takayama, editors, Proceedings of the 2002 Inter- 
national Congress of Mathematical Software, Beijing, China, pages 
40-50. World Scientific Pub., Aug. 2002. 

[3] J.-G. Dumas, T. Gautier, and J.-L. Roch. Generic design of Chinese 
remaindering schemes. In M. Moreno-Maza and J.-L. Roch, editors, 
PASCO 2010. Universite de Grenoble, France, July 2010. 

[4] J.-G. Dumas, P. Giorgi, and C. Pernet. Dense linear algebra over 
word-size prime fields: the fflas and ffpack packages. ACM Trans. 
Math. Softw., 35(3):l-42, 2008. 

[5] U. Erlingsson, E. Kaltofen, and D. Musser. Generic Gram-Schmidt 
orthogonalization by exact division. In ISSAC'1996, pages 275-282, 
July 1996. 

[6] T. Gautier, X. Besseron, and L. Pigeon. KAAPI: a thread schedul- 
ing runtime system for data flow computations on cluster of multi- 
processors. In PASCO'07, pages 15-23, 2007. 

[7] S. Ghemawat and P. Menage. Tcmalloc: Thread-caching malloc. 
http : //goog-perf tools . sourcef orge .net /doc/tcmalloc .html 

[8] E. Kaltofen, D. Morozov, and G. Yuhasz. Generic matrix multipli- 
cation and memory management in LinBox. In ISSAC'2005, pages 
216-223, July 2005. 

[9] D. Traore, J. L. Roch, N. Maillard, T. Gautier, and J. Bernard. Deque- 
free work-optimal parallel STL algorithms. In EUROPAR 2008, pages 
887-897, Las Palmas, Spain, Aug. 2008. 



11 



